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Abstract 

Large-scale  stochastic  linear  programs  can  be  efficiently  solved  by  using  a  blend¬ 
ing  of  classical  Benders  decomposition  and  a  relatively  new  technique  called  impor¬ 
tance  sampling.  The  paper  demonstrates  how  such  an  approach  can  be  effectively 
implemented  on  a  parallel  (Hypercube)  multicomputer.  Numerical  results  are  pre¬ 
sented. 
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1.  Hypercube  Multicomputers 

Advances  in  VLSI  (very  large-scale  integration)  for  digital  circuit  design  are 
leading  to  much  less  expensive  and  much  smaller  computers.  They  have  also  made 
it  possible  to  build  a  variety  of  “supercomputers”  consisting  of  many  small  com¬ 
puters  combined  into  an  array  of  concurrent  processors.  We  shall  refer  to  such  an 
architecture  as  multicomputers.  Each  individual  processor  is  called  a  node.  At 
this  writing,  multicomputers  with  up  to  128  nodes  are  commercially  available  from 
at  least  half  a  dozen  manufacturers.  Typically,  the  nodes  are  the  same  kind  as 
those  used  in  high-end  microcomputers  and  are  relatively  inexpensive.  Significant 
computational  power  can  be  obtained  by  making  many  of  them  work  in  parallel 
at  costs  that  are  much  lower  than  an  equivalent  single  processor.  Obviously,  the 
effectiveness  of  the  approach  depends  on  whether  an  application  can  be  reduced  to 
a  well-balanced  distribution  of  asynchronous  tasks  on  the  nodes.  Linear  program¬ 
ming  and  especially  stochastic  linear  programs  solved  by  decomposition  naturally 
fit  into  this  framework. 


A  Hvpercube  multicomputer  is  essentially  a  network  of  2"  processors  intercon¬ 
nected  in  a  binary  n-cubc  (or  hypercube)  topology.  The  connections  for  r?  <  4  are 
illustrated  in  Figure  1.  Each  processor  (or  node)  has  its  local  memory  and  runs 
asynchronously  of  the  others.  Communication  is  done  by  means  of  messages.  A 
node  can  communicate  directly  with  its  n  neighbors.  Messages  to  more  distant 
nodes  are  routed  through  intermediate  nodes.  The  hypercube  topology  provides  an 


efficient  balance  between  the  costs  of  connection  and  the  benefits  of  direct  linkages.  ~ 
Usually,  a  host  computer  serves  as  an  administrative  console  and  as  a  gateway  to  H* 

the  hypcrcubc  for  users.  Q 

■n _ 

For  the  work  reported  in  this  paper,  we  used  an  Intel  iPSC/2  d6  with  64  nodes  ” 

at  the  Oak  Ridge  National  Laboratory.  Each  node  consists  of  Intel’s  32-bit  S03S6  f _ 
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CPU  (4  MIPS)  coupled  with  a  803S7  (300  Kflops)  numeric  coprocessor  for  floating  jud/0p” — 
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point  acceleration.  It  has  4  MBytes  of  local  memory.  The  hypercube  (or  Cube) 
is  accessed  via  a  host  (or  System  Resource  Manager)  which  is  also  a  80386-based 
system  with  8  MByte  memory  and  a  140  MByte  hard  disk.  The  operating  system  on 
the  host  is  the  UNIX  System  V/386  (Release  3.0).  The  data  transfer  rate  between 
the  System  Resource  Manager  and  the  Cube  has  a  peak  value  of  2800  KBytes/sec. 

Although  the  nodes  are  physically  connected  as  the  edges  of  a  hypercube,  a 
trade-marked  routing  network  called  DIRECT-CONNECT  provides  essentially  uni¬ 
form  communication  linkages  between  all  the  nodes.  The  earlier  “store  and  forward” 
method  used  in  first-generation  hypercubes  is  replaced  by  a  hardware  switching  sys¬ 
tem,  the  Direct-Connect  Module  (DCR)  on  each  node.  Each  DCR  provides  seven 
full-duplex  channels  for  internodal  communication  and  one  for  connection  to  the 
System  Resource  Manager  or  I/O  devices.  The  network  uses  a  special  algorithm 
for  messages  longer  than  100  bytes.  It  first  sends  ahead  a  header  message  to  the 
destination  node.  This  header  sets  gates  in  each  DCR  on  the  intermediate  nodes 
to  clear  a  data  path  for  the  message.  Once  communication  with  the  destination 
node  is  established  with  acknowledgment  of  receipt  of  the  header,  the  message  is 
sent  through  at  essentially  hardware  data  transfer  rates.  The  implication  of  this 
improved  technology  is  that  computational  efficiency  is  essentially  independent  of 
the  problem  domain  to  machine  topology  mapping.  The  hypercube  can  be  pro¬ 
grammed  as  an  ensemble  of  processors  with  an  arbitrary  communications  network 
in  which  each  node  can  communicate  more  or  less  uniformly  with  all  other  nodes. 
The  host  machine  allows  the  user  to  perform  the  following  tasks. 

-  To  edit,  compile  and  link  host /node  programs. 

-  To  access  and  release  the  cube  (or  a  partition  thereof). 

-  To  execute  the  host  program. 

-  To  start  or  kill  processes  on  the  cube. 

Operations  peculiar  to  the  hypercube  are  controlled  either  by  UNIX-type  commands 
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(iPSC/2  commands)  or  by  extensions  to  standard  programming  languages  such  as 
Fortran  and  C  (iPSC/2  routines).  The  iPSC/2  commands  are  used  to  gain  access 
to  the  cube,  to  load,  start  or  kill  cube  processes  and  to  relinquish  access  to  the 
cube.  These  commands  may  be  input  from  a  terminal  or  they  may  be  invoked 
using  a  shell  script.  The  iPSC/2  routines,  on  the  other  hand,  are  mainly  used  to 
manage  internodal  messages.  Nevertheless,  it  should  be  noted  that  almost  all  of 
the  tasks  that  can  be  performed  with  iPSC/2  commands  can  also  be  accomplished 
from  within  the  user  programs  by  iPSC/2  routines  with  similar  names.  The  i TS1^/? 
commands  and  routines  for  the  Fortran  programming  environment  are  documented 
in  Intel  (1988a). 

To  execute  a  typical  parallel  program,  the  following  steps  are  used. 

I  -  Compile  and  link  the  host  and  node  programs  to  create  executable  mod¬ 
ules. 

II  -  Obtain  a  partition  of  the  cube  (a  subcube)  of  suitable  size  by  invoking  the 
GETCUBE  command.  The  user  has  the  option  of  providing  a  name  to 
identify  this  partition.  For  example,  the  command 
“getcube  -c  sugar  -t  d3” 

allocates  to  the  user  an  exclusive  subcube  named  sugar  with  dimension  3 
(i.e.  8  processors)  identified  by  the  node  numbers  0, 1, 2, . . . ,  7. 

III  -  Run  the  host  program  by  invoking  the  name  of  the  executable  host  mod¬ 

ule.  Node  programs  are  loaded  on  to  the  appropriate  nodes  at  runtime  in 
response  to  calls  to  the  LOAD  subroutine  in  the  host  program. 

IV  On  termination,  kill  all  node  processess  and  flush  messages  by  invoking 
the  KILLCUBE  command. 

V  -  Relinquish  access  to  the  subcube  by  the  RELCUBE  command. 

Internodal  and  host-to-node  communication  is  done  by  subroutine  calls  in  the  cor¬ 
responding  programs.  The  subroutine  to  send  messages  is  called  CSEND.  Its  argu- 
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merits  are: 


-  message  type  (ID) 

-  message  location  (address) 

-  message  length  in  bytes 

-  destination  node  ID 

-  destination  process  ID. 

The  subroutine  to  receive  messages  is  called  CRECV.  Its  arguments  are: 

-  message  type  (ID) 

-  address  of  buffer  for  storing  message 

-  length  of  buffer  in  bytes. 

Both  CSEND  and  CRECV  are  blocking  commands  in  the  sense  that  the  calling 
process  halts  until  the  message  has  been  transmitted  and  received,  respectively. 
Non-blocking  versions  of  these  commands  are  also  provided  as  ISEND  and  IRECV 
respectively.  Other  features  necessary  for  our  purpose  are  the  following  functions: 

-  IPROBE  (  ):  indicating  whether  a  message  of  a  particular  type  has  been 
received; 

-  MYHOST  (  ):  indicating  the  node  ID  of  the  host; 

-  MCLOCK  (  ):  returning  elapsed  times  on  the  nodes  and  CPU  times  on 
the  host;  and 

-  MSGWAIT  (  ):  blocking  the  calling  process  until  the  outgoing  message 
has  been  copied  to  the  operating  system  buffer. 

2.  Two-Stage  Stochastic  Linear  Programs 

An  important  class  of  stochastic  models  are  two-stage  stochastic  linear  pro¬ 
grams  with  recourse.  These  models  are  the  analog  extensions  of  deterministic  dy¬ 
namic  systems  which  have  a  staircase  structure:  x  denotes  the  first,  y  the  second 
stage  decision  variables,  A ,  b  represent  the  coefficients  and  right  hand  sides  of  the 
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first  stage  constraints  and  D,  d  represent  the  second  stage  constraints,  which  to¬ 
gether  with  the  transition  matrix  B,  couples  the  two  periods.  In  the  literature  D 
is  often  referred  to  as  the  technology /recourse  matrix.  The  first  stage  parameters 
are  known  with  certainty.  The  second  stage  parameters  are  random  variables  u> 
that  assume  certain  outcomes  with  certain  probabilities  p(u>).  They  are  known  only 
by  their  probability  distribution  of  possible  outcomes  at  time  t  =  1,  where  actual 
outcomes  will  be  known  later  at  time  t  =  2.  Uncertainty  occurs  in  the  transition 
matrix  B  and  in  the  right  hand  side  vector  d.  The  second  stage  costs  /  and  the 
elements  of  the  technology /recourse  matrix  D  are  assumed  to  be  known  with  cer¬ 
tainty.  We  denote  an  outcome  of  the  stochastic  parameters  with  6  11,  with  Q. 
being  the  set  of  all  possible  outcomes.  The  two-stage  stochastic  linear  program  can 

be  written  as  follows: 

min  Z  =  cx  +  E^fy^) 
s/t  Ax  —  b 

-  B^x  +  Dyw  =  d“ 

x,yu  >0,  u>  £  fi,  p(u>)  known. 

The  problem  is  to  find  a  first  stage  decision  x  which  is  feasible  for  all  scenarios 
u  €  ft  and  has  the  minimum  expected  costs.  Note  the  adaptive  nature  of  the 
problem:  While  the  decision  x  is  made  only  with  the  knowledge  of  the  distribution 
p(u>)  of  the  random  parameters,  the  second  stage  decision  yw  is  made  later  after  an 
outcome  u>  is  observed.  The  second  stage  decision  compensates  for  and  adaptes  to 
different  scenarios  u>. 

Two-stage  stochastic  linear  programs  have  been  studied  extensively  in  the  lit¬ 
erature  since  Dantzig  (1955),  for  example  Birge  (1985),  Ermoliev  (1983),  Frauen- 
dorfer  (1988),  Higle  and  Sen  (1989),  Kali  (1979),  Pereira  et  al.  (19S9),  Rockafellar 
and  Wets  (1989),  Ruszczynski  (1986),  Wets  (1984)  and  others  contributed  in  this 
area  (Ermoliev  and  Wets  (1988)).  Parallel  decomposition  for  deterministic  linear 
programs  are  reported  e.g.  in  Entriken  (1989),  Ho  and  Gnanendran  (19S9)  and 
Ho,  Lee  and  Sundarraj  (1988).  Examples  of  using  parallel  processing  for  solving 
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stochastic  programs  axe  Ariyawansa  and  Hudson  (1990),  Hiller  end  Eckstein  (1990), 
Vladimirou  and  Mulvey  (1990),  Wets  (1985),  and  Zenios  (1990). 

The  difficulty  of  solving  large-scale  stochastic  problems  arises  from  the  need 
to  compute  multiple  integrals  or  multiple  sums.  The  expected  value  of  the  second 
stage  costs,  e.g.  for  given  first  stage  decision  variables  x,  z  =  E(fyu> )  =  E(C) 
is  an  expectation  of  functions  C(uw), uj  6  12,  where  C(vu)  is  obtained  by  solving 
a  linear  programming  problem.  V  is  a  h-dimensional  random  vector  parameter, 
e.g.  V  =  (V\ , . . . ,  Vh),  with  outcomes  vu  =  (tq, . . .  ,Ufc)u'.  Clearly,  V  is  composed 
of  the  random  elements  of  the  transition  matrix  B  and  the  random  elements  of 
right  hand  side  d.  For  example  V,  represents  the  percent  of  generators  of  type  i 
down  for  repair  or  transmission  lines  not  operating  and  rj*'  the  observed  random 
percent  outcome,  or  Vt  represents  an  uncertain  electricity  demand  in  demand  region 
i  and  v,  the  observed  demand  realization.  We  also  will  denote  the  vector  by 
v.  The  corresponding  probability  is  denoted  by  p(vw)  sometimes  p(v )  or  pw.  We 
assume  independence  of  the  stochastic  parameters  12.  The  set  of  all  possible  random 
events,  is  constructed  by  crossing  the  sets  of  outcomes  12j,z  =  as  12  = 

fli  x  Q2  x  •  ■  •  x  12*.  The  expectation  E  C(V')  takes  on  the  form  of  a  multiple  integral 
E  C ( V )  =  f  •  ■  ■  f  C(v)p(v)dv  1  . .  .dvh,  or,  in  case  of  discrete  distributions,  the  form 
of  a  multiple  sum  E  C{V)  =  £)„,  •  •  ■  Civ)p(v)i  where  p(v)  =  p1(v1 ) . . .  ph(vh). 

In  the  following  discussion  we  concentrate  on  discrete  distributions.  In  this  case 
Q  takes  on  K  values.  For  relevant  practical  problems,  K  can  get  very  large  and 
easily  out  of  hand.  Consider,  for  instance,  the  number  of  stochastic  parameters  h 
being  as  small  as  20  and  S2,,  the  set  of  possible  outcomes  of  parameter  i  containing 
K,  =  5  possible  outcomes  each.  Each  term  requires  a  function  evaluation  which 
can  be  computationally  expensive  since  its  value  is  obtained  as  the  optimal  solution 
of  a  linear  program.  The  number  of  terms  in  the  multiple  sum  computation,  K  = 
f20  1014.  It  is  clear  that  the  problem  is  no  longer  practical  to  b^  solved  by  direct 
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summation. 

Using  discrete  distributions,  one  can  express  a  stochastic  problem  as  a  deter¬ 
ministically  equivalent  linear  program  by  writing  down  the  second  stage  constraints 
for  each  scenario  ui  6  Q  one  below  the  other.  The  objective  function  carries  out  the 
expected  value  computation  by  direct  summation.  Clearly,  this  formulation  leads 
to  linear  programs  of  enormous  sizes. 

min  Z—  cx+p'fy >  +  p2fy2  +  •  •  •  +  pK  fyK 


s/t  Ax  —  b 

—  Blx  +  Dyl  =  dx 

-B2x  +  Dy2  =  d2 

-Bkx  +  Dyh  =  dK 

x,  y\  y2,...,yA  >  0. 


The  method  which  we  apply  to  solve  large-scale  stochastic  linear  programs  uses 
Benders  decomposition  and  importance  sampling.  The  method  and  the  underlying 
theory  of  our  approach  is  developed  in  Dantzig  and  Glynn  (1990)  and  Infanger 
(1990,  1991).  Dantzig  and  Infanger  (1991)  report  on  the  solution  of  large-scale 
problems.  Entriken  and  Infanger  (1990)  discuss  how  reliability  constraints  can  be 
handcled  by  additionally  using  Dantzig- Wolfe  decomposition.  In  the  following  we 
give  a  brief  review  of  the  concept.  Using  decomposition  techniques  we  split  the 
problem  into  a  series  of  tractable  smaller  problems.  Using  sampling  techniques  we 
compute  an  estimate  of  the  expected  costs  and  variances.  Importance  sampling  is 
the  key  to  obtaining  accurate  estimates,  i.e.  unbiased  estimates  with  low  variances, 
with  low  sample  size. 

3.  Benders  Decomposition 

We  decompose  the  two  stage  stochastic  linear  program  using  Benders  (1962) 
dual  decomposition.  According  to  Van  Slyke  and  Wets  (1961)  we  express  the  sum  of 
second  stage  costs  by  a  scalar  9  and  replace  the  second  stage  condit  ions  sequentially 
by  “  cuts”,  which  are  necessary  conditions  expressed  only  in  terms  of  the  first  stage 
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decision  variables  x  and  9.  The  problem  then  decomposes  into  a  master  problem 
and  into  independent  subproblems,  one  for  each  u ;  E  ft.  The  latter  are  used  to 
generate  the  cuts,  unbiased  estimates,  and  variances. 

The  master  problem: 

min  zm  =  cx  4-  9 

s/t  Ax  =  b 

cuts  :  —  Glx  +  al9  >  gl ,  /  =  1 . L 

x,  9  >  0, 

where  “cuts”,  are  initially  absent  and  are  added  one  each  iteration.  On  iteration  / 
the  master  problem  is  optimized  to  obtain  an  approximate  optimal  feasible  solution 
x  =  xl  (using  only  the  /  cuts  generated  so  far)  which  is  passed  as  input  to  the 
subproblems.  The  value  of  the  scalar  9  gives  an  approximation  to  the  expected 
subproblems  costs  and  z\f  =  cx /  +  9  gives  a  lower  bound  estimate  of  minZ. 

The  sub  problems: 

The  solution  x1  of  the  master  problem  of  iteration  /  is  sent  as  input  to  each 
subproblem  u>  which  is  then  solved  to  obtain  the  optimal  costs  of  the  second  stage 
problem:  namely,  for  each  scenario  u>  E  ft  for  given  x1 ,  the  following  subproblem  is 
solved: 

min  =  /yw 

s/t  :  Dy *  -  +  B^x 

y“'  >  0lUj£fi,  c.g.  ft  =  {1,2 . A'}. 

z *  =  is  the  optimal  objective  function  value  as  a  function  of  x1 .  The  dual 

multipliers  =  n^ix1)  corresponding  to  the  constraints  in  scenario  u?,  are  then 
used  to  generate  the  next  cut  /  to  augment  the  set  of  /  -  1  cuts  found  so  far  for  the 
master  problem. 

The  cuts  (definition  of  G.g.ir^  for  cut  /): 

G  =  E  (7 r-'m,  g  =  E  (tt wdw),  tt"  =  n^d1) 


0 


Note  that  if  a  subproblem  is  infeasible  a  different  definition  of  the  cut  is  used,  a1  =  0 
corresponds  to  feasibility  cuts  and  a1  =  1  corresponds  to  optimality  cuts. 

The  expected  value  of  the  second  stage  costs: 

*<i')  =  E  (z"(i')) 

Lower  LBl  and  upper  UBL  bounds  to  the  problem:  U B°  =  oo, 

LB1  =  zLm,  UBL  =  min{U BL~\cxl  +  z(x1)} , 

The  optimum  objective  function  value  zlM  of  the  master  problem  in  iteration  /, 
provides  a  lower  bound  of  the  objective  function  value  of  the  optimum  solution  of 
the  problem  which  monotonically  increases  with  l.  The  expected  costs  ex1  +  z(x') 
associated  with  a  trial  solution  x1,  provide  an  upper  bound  to  the  optimal  costs  of 
the  problem.  These  upper  bounds,  however,  do  not  monotonically  decrease  with  /. 
hence  we  recursively  redefine  the  best  solution  as  the  one  associated  with  the  lowest 
upper  bound  obtained  so  far. 

Computing  these  expectations  exactly  can  be  in  practice  an  impossible  task. 
Solving  all  subproblems  u;  6  ft  once  they  are  seeded  with  a  trial  solution  x1  from 
the  master  problem  means  total  evaluation  for  all  uj  and  these  can  be  too  many. 
Instead  we  use  a  specialized  Monte  Carlo  sampling  technique  to  select  a  sample 
of  subproblems  u,  u  £  5,  to  compute  estimates  of  the  second  stage  costs  z1  and 
estimates  of  the  gradients  G1  and  right  hand  sides  g1  of  the  cuts.  Using  estimates 
for  the  gradient  and  the  right  hand  side  of  the  cuts,  and  estimates  of  the  second 
stage  costs  we  obtain  estimates  of  thp  lower  and  upper  bounds  to  the  problem. 
These  lower  and  upper  bound  estimates  are  viewed  as  sample  means  drawn  from  a 
population  of  i.i.d.  random  terms.  If  the  sample  size  is  forty  or  more  the  sample 
means  for  all  practical  purposes  can  be  assumed  to  be  normally  distributed.  The 
estimation  process  also  provides  estimates  of  the  variances  of  these  sample  means. 
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A  95%  confidence  interval  for  the  objective  of  the  obtained  optimal  solution  is 
computed.  A  Student-t  test  is  used  to  test  whether  the  lower  and  upper  bounds  of 
the  objective  are  sufficiently  close.  If  yes,  the  problem  is  considered  solved  and  the 
iterative  process  terminated. 

4.  Importance  Sampling 

Monte  Carlo  Methods  is  the  way  recommended  by  numerical  analysts  to  com¬ 
pute  multiple  integrals  or  multiple  sums  of  functions  of  v  where  v  is  a  point  in  a 
higher  dimensional  space  h,  say  h  >  4,  (Davis  and  Rabinowitz  (1984)).  Suppose 
Cu  =  C(vu)  are  independent  random  variates  of  =  1, . . . ,  n  with  expectation 
z,  where  n  is  the  sample  size.  An  unbiased  estimator  of  z  is 

n 

i  =  (1»  £  C". 

Uf=  1 

with  variance  a\  —  a2 /n,  a2  =  var(C(V)).  Note  that  the  standard  error  decreases 
with  n“°  5  and  the  convergence  rate  of  z  to  z  is  independent  of  the  dimension  of 
the  sample  space  h.  Note  also  the  inherent  parallelism  of  the  approach.  C“  are 
random  variates  obtained  by  solving  a  linear  program.  The  computation  of  Cw 
can  be  carried  out  in  parallel.  Different  sample  problems  are  assigned  to  different 
processors  and  solved  concurrently.  As  the  computation  of  Cw  is  the  computation¬ 
ally  most  expensive  part  in  the  Monte  Carlo  scheme,  the  parallel  impementation 
can  be  anticipated  to  be  highly  efficient,  an  anticipation  which  is  borne  out  by  the 
experimental  results  to  be  presented  later. 

Importance  sampling  is  a  variance  reduction  technique  often  applied  in  sim¬ 
ulation  models  (Glynn  and  Iglehart  (1989)).  We  rewrite  z  =  ]Cu>en  C'(t,u')p(t>w) 
as 

y-  C(vw)p{vu’)q{vu) 
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by  introducing  a  new  probability  mass  function  q(vw)  and  we  obtain  a  new  estimator 


for  2, 


_  1  ^  C(vw)p(v“) 

n  q(v") 


by  sampling  from  the  distribution  q{vu).  The  variance  of  z  is  given  by 

. _x  1  v->  / C{vul)p(vLi)  \2 

™r(2,=  nE(-^) - *)  )' 

Chosing  9*(uw)  =  c(v")p(v<* )  wou^  lea<l  vaT{z)  =  0)  which  means  one  could 

get  a  perfect  estimate  of  the  multiple  sum  from  a  sample  size  n  =  1.  However,  this  is 
a  useless  result  since  to  compute  q( vu)  we  would  need  to  know  z  =  Ylu/en 
which  is  what  we  wanted  to  compute  in  the  first  place.  Nevertheless,  this  suggests 
the  following  heuristic  for  choosing  q.  It  should  be  proportional  to  the  product 
C(vu)p(vu>)  and  should  be  of  such  a  form  that  it  can  be  integrated  easily.  Thus 
a  function  F(i;u')  ss  C(uw)  is  sought,  which  can  be  integrated  with  less  effort  than 
C(uw).  Additive  and  multiplicative  (in  the  components  of  the  stochastic  vector  v) 
approximation  functions  and  combinations  of  these  are  candidate  approximation 
functions.  In  particular,  we  have  been  getting  good  results  using  as  our  approxi¬ 
mation  to  C(v)  a  function  of  the  form  £T=1  C,(Vi)  where  C,(vi )  is  (in  general)  a 
non-linear  function  of  a  scalar  variable  v,.  We  compute  q  as 


E?-,  c,(v") 

To  understand  the  motivation  for  the  importance  sampling  scheme,  assume  for 
convenience  C,(v“' )  >  0  and  let  r(uw)  =  Y^=i  If  C*( )  were  used 


as  an  approximation  of  z  it  can  be  written 


c\(»r) 


IpiK'taK’)  ••«.(<*) 


i=l  w=  1 


where  u>  =  (wj  ,u>2,  ■ . .  ,u>h)  and  where  we  define 

«»  =  Ci(vDPi(v?i)i 
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which  is  relatively  easy  to  compute  since  it  can  be  evaluated  by  summing  only  one 
of  the  dimensions  of  u>.  Note  that 


*W)  -  >  0,  «,  6  n, 

oti 


may  be  viewed  as  a  modified  probability  distribution  of  V{  associated  with  the  i 
term.  It  is,  of  course,  a  trivial  matter  to  directly  sum  each  term  i  since  each  of 
its  factors,  being  independent  probability  distributions,  sum  to  one.  Suppose,  how¬ 
ever,  one  does  not  notice  this  fact  and  decides  to  estimate  the  sum  by  estimating 
each  of  the  h  terms  by  Monte  Carlo  sampling.  The  i-th  term  would  then  be  evalu¬ 
ated  by  randomly  sampling  V{  from  the  distribution  pi(vf‘)  and  all  the  rest  of  the 
components  v}  of  v  from  the  distributions  Pi(vf‘). 

In  an  analogous  manner,  we  let 


p(w) 


CM 

T(u) 


and  write 


2  =  5^C(w)p(w)  =  5^p(w)r(u/)p(u/) 

=  E  q<  E  ?.(»r  )P2W> )  ■  • .  p»  «* ) 


i=l  w— 1 


If  our  approximation  r(u>)  to  C(u>)  is  any  good,  p(u>)  will  be  roughly  1  for  almost 
all  u.  This  suggests  the  heuristic  that  the  sampling  be  carried  out  differ  itly  for 
each  term  i.  The  importance  sampling  scheme  then  is  to  sample  Vi  of  the  z-th  term 


according  to  the  distribution  )  and  to  sample  all  other  components  v“’  of  the 

z-th  term  according  to  the  distribution  Pi{v“’). 

If  the  additive  function  turns  out  to  be  a  bad  approximation  of  the  cost  func¬ 
tion,  a s  indicated  by  the  observed  variance  being  too  high,  it  is  easily  corrected  by 
increasing  the  size  of  the  sample. 
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Actually  we  use  a  variant  of  the  additive  approximation  function.  By  introduc¬ 
ing  C(t),  the  costs  of  a  base  case,  we  make  the  model  more  sensitive  to  the  impact 

of  the  stochastic  parameters  v.  Our  approximation  function  is  computed  as  follows: 

h 

r(V)  =  C(t)  +  £r,(V(),  r,(K)  =  C( r, . r,_,,V„rt+l . n)  -  C(r) 

«=1 

We  refer  to  this  as  a  marginal  cost  approximation.  We  explore  the  cost  function 
at  the  margins,  e.g.  we  vary  the  random  elements  Vi  to  compute  the  costs  for  all 
outcomes  Vi  while  we  fix  the  other  random  elements  at  the  level  of  the  base  case,  r 
can  be  any  arbitrary  chosen  point  of  the  set  of  k{  discrete  values  of  Uj,  i  =  1, . . . ,  h. 
For  example  we  choose  r,  as  that  outcome  of  V,  which  leads  to  the  lowest  costs, 
ceteris  paribus. 

Summarizing,  the  importance  sampling  scheme  has  two  phases:  the  preparation 
phase  and  the  sample  phase.  In  the  preparation  phase  we  explore  the  cost  function 
C(V)  at  the  margins  to  compute  the  additive  approximation  function  r(V).  For 
this  process  nprtp  =  1  +  £*=i(fc«  —  1)  subproblems  have  to  be  solved.  Using  T(V) 
we  compute  the  approximate  importance  density 

r(vw)p(t/*' 

C(r)  +  Ef=!  Ewen,  r*(wW)p(vw) 

Next  we  sample  n  scenarios  from  the  importance  density  and,  in  the  sample  phase, 
solve  n  linear  programs  to  compute  the  estimation  of  z  using  the  Monte  Carlo 
estimator.  We  compute  the  gradient  G  and  the  right  hand  side  g  of  the  cut  using 
the  same  sample  points  at  hand  from  the  expected  cost  calculation.  See  Infanger 
(1990,  1991)  for  the  computation  of  the  cuts  and  details  of  the  estimation  process. 
The  function  evaluations  in  the  preparation  phase  and  the  sample  phase  are  “made 
to  order”  for  parallel  processing. 

5.  The  parallel  algorithm 

The  Hypercube  computer  has  the  architecture  of  losely  coupled  multiproces¬ 
sors.  The  nodes  of  the  cube  are  independent  processors,  where  each  processor  has 
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its  own  operating  system  and  its  own  memory.  The  nodes  are  connected  via  a 
communication  network.  Information  is  exchanged  between  nodes  only  by  sending 
messages.  The  hypercube  architecture  defines  which  nodes  are  directly  connected 
and  which  nodes  are  only  indirectly  connected  via  third  nodes.  Message  routing 
systems  of  modern  Hypercube  computers,  like  .ne  Intel  iPSC/2  computer  that  we 
are  using,  ensure  that  communication  between  indirectly  connected  nodes  is  very 
fast.  Thus  the  difference  in  the  communication  time  between  directly  and  indirectly 
connected  nodes  is  neglectable.  However,  the  time  spent  for  communication  can  be 
significant,  if  much  information  is  exchanged  between  nodes.  Therefore  the  design  of 
a  parallel  algorithm  for  losely  connected  multiprocessors  should  be  laid  out  in  such 
a  way  that  only  minimum  amounts  of  information  have  to  be  exchanged  between 
nodes. 

The  main  work  is  in  the  repeated  solving  of  the  master  problem,  and  the 
subproblems  in  the  preparatory  phase  and  in  the  sample  phase.  All  other  tasks  are 
comparably  unimportant  with  respect  to  computing  time.  We  assign  processor  0 
to  be  the  master  processor.  Besides  its  main  task  of  solving  the  master  problem, 
the  master  processor  also  controls  the  computation  and  synchronizes  the  algorithm. 
The  other  processors  (1  -  63)  were  assigned  to  be  subprocessors,  with  the  main  task 
of  solving  subproblems.  This  design  requires  communication  between  the  master 
processor  and  the  sub  processors.  No  information  needs  to  be  exchanged  between 
different  sub-processors. 

In  addition  there  is  a  host  processor  which  has  access  to  data  storage  devices 
and  manages  data  input  and  output.  The  execution  of  the  parallel  program  follows 
the  following  general  steps:  The  host  processor  loads  the  host  module  (the  exe¬ 
cutable  file  for  the  host  processor)  into  its  memory  and  starts  the  execution.  Next 
the  executable  files  for  the  master  processor  and  the  sub  processors  are  loaded  into 
the  host  and  then  sent  to  the  master  processor  and  the  sub  processors  respectively. 
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The  master  processor  and  the  sub  processors  after  they  receive  their  modules  start 
execution.  After  processing  the  input  data  and  sending  it  to  master  and  subs,  the 
host  remains  inactive  and  waits  until  it  receives  the  optimal  solution  from  the  mas¬ 
ter  processsor.  During  this  time  the  algorithm  is  performed  entirely  in  the  cube 
and  the  master  processor  controls  the  execution  of  the  program.  After  receiving 
the  optimal  solution,  the  host  processor  outputs  the  solution  to  the  disk,  stops  the 
execution  of  the  programs  of  the  master  and  sub  processors,  and  releases  the  cube, 
terminating  the  parallel  program. 

The  problem  data  includes  the  problem  specification  of  the  master  and  the  sub, 
the  stochastic  information  and  control  parameters  for  the  execution  of  the  program. 
The  input  data  for  specifying  the  master  problem  and  the  sub  problem  are  given  in 
the  form  of  an  MPS  file.  Internally  the  problems  are  stored  in  the  form  of  the  data 
structures  used  by  the  linear  programming  solver,  which  we  use  as  a  subroutine.  We 
adapted  LPM1  (Tomlin  1973),  a  linear  programming  optimizer,  for  our  purposes. 
Clearly,  the  master  processor  only  receives  the  data  for  the  master  problem  and  the 
sub  processors  only  get  the  data  for  the  subproblem.  Thus  no  switching  between 
different  problems  is  necessary,  as  it  would  be  in  a  serial  implementation.  Both 
master  and  subprocessor  receive  the  complete  stochastic  information.  The  stochas¬ 
tic  data  include  the  identification  of  the  stochastic  parameters  within  the  problem 
and  their  distributions. 

An  index  vector  vw  —  (j/j,  . . .  ,Vk)w  completely  defines  a  scenario  u.  We  define 
or  Vi  =  l,...,fc,-,  i  =  1  ,...,/i.  For  example  vu  =  (1,3,2)  would  denote 
a  scenario  given  by  the  first  outcome  of  random  parameter  1  the  third  outcome  of 
random  parameter  2  and  the  second  outcome  of  random  parameter  3.  Thus  only 
the  index  vector  vu  is  transmitted  from  the  master  processor  to  a  sub  processor 
to  identify  the  scenario  subproblem  to  be  to  be  solved.  For  example  for  h  =  20 
and  a  4  byte  integer  representation  80  bytes  have  tc  be  sent.  Besides  the  scenario 
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information  i/w  the  current  solution  of  the  master  problem,  xl,  is  needed  to  set-up 
the  scenario  problem  u>.  We  only  pass  xl  to  each  subprocessor  j  once  per  iteration  at 
the  beginning  of  the  preperation  phase.  The  flag  Ix  €  {0, 1}  tells  the  subprocessor 
if  an  x  has  to  be  received  (1)  or  not  (0). 

Now  subprocessor  j  looks  up  the  outcomes  of  the  stochastic  parameters  corre¬ 
sponding  to  v  to  set  up  the  the  vector  b(u)  and  the  matrix  B{v).  Using  x  the  right 
hand  side  b{ v)  +  B(v)x  is  computed  and  the  sub-problem  is  solved. 

In  any  case  the  optimal  objective  function  value  z( v)  has  to  be  sent  to  the 
master  processor.  Dual  information  for  the  coefficients  G  and  the  right  hand  side  g 
of  the  cut  is  all  that  is  needed  from  the  base  case  scenario  and  all  sample  scenarios. 
In  this  case  we  compute  the  products  G( v)  =  B(u)'k(u)  and  g(u)  =  b(u)Tr(v)  and 
send  the  result  to  the  master  processor.  The  flag  Ic  tells  the  subprocessor  if  the 
computation  and  the  sending  of  G(v)  and  g(u)  is  requested  if  (1),  or  not  if  (0). 

In  our  design  the  subprocessors  do  not  have  any  information  of  the  status  of  the 
algorithm.  The  subprocessors  set-up  and  solve  the  subproblems  and  post-process 
the  solution.  The  computation  is  controlled  by  the  master  processor  through  the 
flags  Iz  and  Ic. 

The  master  processor  runs  the  entire  algorithm  except  obtaining  solutions  of 
subproblems.  An  important  task  concerns  the  controlling  of  the  assignment  of  sub¬ 
problems  u  to  subprocessors  j  in  the  case  where  more  sample  subproblems  have 
to  be  solved  per  iteration  than  there  are  subprocessors  available.  Assigning  sub¬ 
problems  in  equal  proportions  to  subprocessors  is  not  always  possible  for  all  sample 
sizes  nor  is  it  most  efficient.  Different  subproblems  need  different  amounts  of  time 
for  getting  solved.  The  solution  time  mainly  depends  on  how  many  columns  of  the 
starting  basis  (from  which  the  solving  procedure  is  started)  differs  from  the  optimal 
basis  of  the  subproblem.  Clearly,  it  makes  sense  and  is  convenient  to  use  as  start¬ 
ing  basis  the  optimal  basis  of  the  subproblcm  which  was  last  solved  on  the  same 
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processor. 

We  implemented  an  algorithm  to  adaptively  balance  the  work  load  of  the  sub¬ 
processors.  In  our  scheme  the  master  processor  keeps  track  if  a  subprocessor  j  is 
busy  or  idle.  At  the  beginning  of  each  solving  phase  (preparatory  and  sample  phase) 
all  subprocessors  are  idle.  The  master  processor  initiates  a  subprocessor  working  by 
sending  the  first  message  ( Jx,  Ic)  to  it.  At  this  time  the  subprocessor  is  set  to  busy. 
It  is  set  to  idle  again  when  it’s  solution  has  arrived  at  the  master  processor.  Given  a 
queue  of  subproblems  to  be  solved,  the  first  subproblem  in  the  queue  is  assigned  to 
the  next  idle  subprocessor.  The  master  processor  keeps  switching  between  sending 
out  problems  and  receiving  solutions  until  all  subproblems  axe  solved.  Of  course 
the  mapping  u  — ►  j  is  not  unique  because  different  subproblems  u>  are  solved  by  one 
subprocessor  j.  However,  because  we  only  send  a  new  problem  after  the  solution 
of  the  previous  problem  has  been  received,  the  solution  of  a  subproblem  ui  can  be 
identified  as  uniquely  coming  from  subprocessor  j. 

We  can  now  summarize  and  state  the  algorithm.  Step  2  is  computationally  the 
most  expensive  part  and  is  the  part  computed  by  using  parallel  processors. 


The  Algorithm 


Host  processor 


Step  H:  0.0 
Step  H:  0.1 


Load  host  executable  modul  from  disk. 

Load  master  modul  from  disk. 

Send  master  module  to  processor  0. 

Load  sub  module  from  disk 

Send  sub  module  to  processors  j,  j  =  1, . . . ,  J. 

Step  H:  0.2  Read  data  and  from  disk. 

Send  control  data  and  stochastic  data  to  processors  j ,  j 
Send  master  problem  data  to  processor  0. 

Send  sub  problem  data  to  processors  j,  j  =  1, . . . ,  J. 
Step  H:  6  Receive  optimal  solution. 

Write  solution  report. 

Kill  cube.  Stop. 


=  0 . J. 
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Master  Processor 

Step  M:  0  Receive  master  module  from  host  processor. 

Receive  control  and  stochastic  data  from  host  processor. 

Receive  master  problem  data  from  host  processor. 

Initialize:  /  =  0,175°  =  oo. 

Step  M:  1  Solve  the  relaxed  master  problem. 

Obtain  a  trial  solution  xl  and  a  lower  bound  LB1. 

Step  M:  2.0  1  =  1  +  1. 

Step  M:  2.1  Determine  preparatory  scenarios  =  [v\, . . . ,  Vh)^ =  1, . . . ,  npTtp. 
Step  M:  2.2  u>  =  1, . . . ,  nprep  : 

Determine  ll>  — »  j. 

Send  I]x,  1%  to  subprocessor  j. 

Send  xl  to  subprocessor  j. 

Send  to  subprocessor  j. 

U1  =  1 ,  .  .  .  ,  Tlprep 

Receive  zu  from  subprocessor  j. 

If  /"  =  1:  Receive  Gu,  gu  from  subprocessor  j. 

Step  M:  2.3  Compute  the  importance  distribution. 

Step  M:  2.4  Sample  scenarios  1/“  =  (1/1 , . . . ,  =  l,...,n  from  the  importance 

distribution. 

Step  M:  2.5  u  =  1, . . .  ,n  : 

Determine  cj  — ►  j. 

Send  I{,  Ix  to  subprocessor  j. 

Send  i/w  to  subprocessor  j. 
w  =  1, . . .  ,n  : 

Receive  from  subprocessor  j. 

Receive  Gw,  gw  from  subprocessor  j. 

Step  M:2.6  Obtain  estimates  of  the  expected  second  stage  cost,  the  coefficients  and 
the  right  hand  side  of  the  cut.  Add  the  cut  to  the  master  problem.  Obtain 
an  upper  bound  U Bl . 

Step  M:  3  Solve  the  master  problem. 

Obtain  a  trial  solution  xl  and  a  lower  bound  LB1 . 

Step  M:  4  s  =  UBl  -  LB1  +  TOL 

If  s  >  0  (Student-t  test)  go  to  Step  2. 

Step  M:  5  Obtain  a  solution  and  compute  confidence  interval. 

Step  M:  6  Send  optimal  solution  to  host  processor. 

Sub  Processor  j: 

Step  S:  0  Receive  sub  module  from  host  processor. 

Receive  control  and  stochastic  data  from  host  processor. 

Receive  sub  problem  data  from  host  processor. 

Step  S:  2.1  Receive  Ix ,  Ic  trom  the  master  processor. 

If  1 1  —  1:  Receive  x  from  the  master  processor. 

Receive  v  from  the  master  processor. 
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Step  S:  2.2 
Step  S:  2.3 
Step  S:  2.4 
Step  S:  2.5 


Step  S:  2.6 


Compute  B( v),  b(u)  and  the  right  hand  side  b(v)  +  B(v)x. 
Solve  scenario  subproblem  v. 

Send  z(u)  to  the  master  processor. 

If  Ic  =  1: 

Compute  G(is)  =  i r(i/)B(i/),  g(v)  =  n(u)b(u). 

Send  G(v),  g(is)  to  the  master  processor. 

Go  to  Step  S:  2.1 


6.  Performance  Measures 

Parallel  processing  main  purpose  is  to  speed  up  computing  time  relative  to 
conventional  sequentional  computation.  In  the  case  when  large  sample  sizes  are 
necessary  in  order  to  obtain  good  approximate  solutions  to  stochastic  linear  pro¬ 
grams,  parallel  processing  is  an  important  part  of  the  solution  technique,  because 
the  solution  times  on  sequential  computers  may  exceed  time  limits  for  practically 
solving  the  problem. 

Assuming  that  a  number  p  of  processors  are  available  and  allocated  to  solve 
the  problem  at  hand,  we  compare  the  parallel  time  utilizing  p  processors  to  the 
sequential  time  using  only  1  processor.  We  define  the  parallel  time  tp  the  duration 
from  start  to  finish  of  the  solution  process  in  the  parallel  implementation.  In  terms 
of  CPU  times  tp  covers  the  disjoint  union  (nonoverlapping  total)  of  CPU  times 
of  all  processors.  We  define  the  sequential  tme  t„  the  sum  of  all  CPU  times  of  all 
processors.  The  sequential  time  t3  differs  from  a  sequential  time  obtained  by  actualy 
solving  the  problem  on  one  processor.  This  would  require  a  different  implementation 
and  would  not  be  directly  comparable.  In  a  serial  version  no  messages  are  sent.  On 
the  other  hand  computing  resources  are  needed  for  alternately  switching  between 
solving  the  master  problem  and  the  subproblems. 

The  speedup  S  in  using  p  processors  instead  of  one  is  given  by  S  =  f1.  The 
efficiency  is  defined  by  E  =  |  x  100%. 

A  simple  set  of  algebraic  formulae  can  be  used  to  predict  the  sequential  time  t a 
and  the  parallel  time  tp.  We  denote  t\i,\  the  mean  duration  to  compute  the  tasks 
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assigned  to  the  master  processor  per  iteration.  We  define  tsUB  the  mean  duration 
to  compute  the  tasks  assigned  to  a  subprocessor  (mainly  solving  one  subproblem) 
when  starting  from  the  optimal  solution  of  the  previously  solved  subproblem  and 
t°su B  the  mean  duration  if  solving  a  subproblem  from  scratch.  Thus  with  L  being 
the  number  of  iterations, 


U 

L 


=  tMA  +  t°suB  +  (nprep  +  n)  tsUB 


and 

tp  _  f  tMA  +t°suB  +  ^^T^tsuB,  if  n,nprep>p-  1; 

L  [tMA  +i°suB  +  (-^zrrtsUB,  if  n  >  p  -  1,  nprtp  <  p-  1. 

If  the  sample  size  n  is  smaller  than  the  number  of  sub-processors  the  parallel  al¬ 
gorithm  is  not  efficient  because  not  all  computer  recources  are  utilized.  Using  the 
above  formulae,  we  can  compute  the  efficiency  e.g.  for  the  case  of  n,  nprep  >  p  —  1 
as 

E  —  *MA  +  t°SUB  +  (np  rep  +  Tl)tsUB 

P  tMA  +Pt°SUB  +  (Wprep  +  n)tSUB 

One  can  see  for  a  fixed  number  of  processors  the  efficiency  approaches  100%  as 
sample  size  increases.  This  is  obvious  because  increasing  the  sample  size  means 
adding  computational  work  which  can  be  conducted  in  parallel.  Thus  the  parallel 
implementation  is  most  efficient  when  solving  problems  which  require  large  sample 
sizes.  On  the  other  hand  one  can  also  see  that  for  a  given  sample  size  the  effi¬ 
ciency  decreases  with  increasing  number  of  processors.  The  maximum  number  of 
processors  which  can  be  utilized  meaningfully  is  1  +  max  {nprep,n}. 


7.  Numerical  Results 

Experiments  were  conducted  to  validate  the  parallel  implementation  and  to 
obtain  measures  of  computing  time,  speedup  and  efficiency.  Test  problems  taken 
from  the  literature  are  usually  small  with  a  small  number  of  stochastic  parameters. 
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In  order  to  test  our  methodology  on  truly  large-scale  problems  we  build  two  classes 
of  models  based  on  practical  planning  models  in  electric  power  and  financial  in¬ 
vestment.  Numerical  results  using  the  serial  implementation  of  the  algorithm  are 
reported  in  Infanger  (1991)  and  Dantzig  and  Infanger  (1991).  Here  we  report  on 
the  performance  of  the  parallel  algorithm  on  one  of  these  large-scale  test  problems. 

All  experiments  were  performed  on  he  large-scale  test  problem  BIG  NEW 
which  is  a  modified  version  of  the  capacity  expansion  planning  model  WRPM,  a  de¬ 
scription  of  which  can  be  found  in  Dantzig  et  al.  (1989).  It  is  a  multi-area  capacity 
expansion  planning  problem  for  western  USA  from  Canada  to  Mexican  border.  The 
model  is  quite  detailed  and  covers  6  regions,  3  demand  blocks,  2  seasons,  and  several 
kinds  of  generation  and  transmission  technologies.  The  objective  is  to  determine 
optimum  discounted  least  cost  levels  of  generation  and  transmission  facilities  for 
each  region  of  the  system.  The  model  minimizes  the  total  discounted  costs  of  sup¬ 
plying  electricity  (investment  and  operating  costs)  to  meet  the  exogenously  given 
demand  subject  to  expansion  and  operating  constraints. 

In  the  stochastic  version  of  the  model  the  availabilities  of  generators,  trans¬ 
mission  lines,  and  demands  are  subject  to  uncertainty.  There  are  11  stochastic 
parameters  (8  stochastic  availabilities  of  generators  and  transmission  lines  and  3 
uncertain  demands)  with  discrete  distributions  with  3  or  4  outcomes.  While  other 
implementations  of  WRPM  cover  up  to  3  future  time  periods,  BIGNEW  covers  a 
planning  horizon  of  only  one  future  time  period  and  is  formulated  as  a  two-stage 
stochastic  linear  program  with  recourse.  The  problem  is  large-scale  but  is  by  far 
not  the  largest  we  have  solved  serially.  The  number  of  universe  scenarios  is  about 
106;  the  equivalent  deterministic  formulation  of  the  problem  (if  it  were  possible  to 
state  it  explicitely)  would  have  more  than  0.3  billion  constraints. 

This  test  problem  has  been  solved  repeatedly  using  different  numbers  of  pro¬ 
cessors  and  different  sample  sizes.  The  parallel  implementation  has  been  improved 
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as  we  learned  more  about  its  characteristics.  For  example  in  our  first  implementa¬ 
tion,  tables  of  indices  vw,u>  =  l, . . .  ,nprep  and  i/w,u  =  were  sent  to  each 

sub  processor  and  the  sub  processors  extraced  the  from  the  table  when  solving 
subproblem  u>.  In  this  case  the  table  lookups  are  done  in  parallel.  However,  it  re¬ 
quites  considerable  communication.  When  sending  tables  of  indices  to  all  procesors 
the  message  length  in  byte  is  nprep  or  n  times  larger  than  am  alternative  procedure 
which  sends  only  vu  to  the  corresponding  sub  processor.  Table  1  gives  a  compari¬ 
son  of  computing  time  of  sending  tables  versus  only  index  arrays.  E.g.  a  table  has 
3.S  kBytes  versus  an  index  array  has  only  60  Bytes.  At  this  stage  the  mapping  of 
subproblems  to  processors  u  — >  j  was  hardwired.  Thus  the  number  of  subproblems 
to  be  solved  in  each  iteration  (number  of  preparatory  subproblems  and  sample  size) 
was  limited  by  the  number  of  processors  at  hand  for  the  computation.  The  compar¬ 
ison  of  the  two  implementation  shows  differences  in  the  CPU  time  which  increase 
appioxiinately  linearly  with  the  sample  size.  The  differences  are  small,  however, 
compared  to  the  total  CPU  time.  The  comparison  shows  that  communications  to 
the  extent  required  by  this  algorithm  do  not  influence  the  performance  significantly. 
However,  it  is  clear  that  extensive  communication  on  some  problems  could  increase 
the  computing  time  significantly. 

Next  we  varied  the  sample  size  between  the  range  of  20  and  63,  where  we  always 
have  at  least  as  many  processors  at  hand  as  subproblems  have  to  be  solved  in  one 
parallel  phase.  Table  2  represents  the  results.  The  computing  time  (measured  in 
CPU  minutes  per  iteration)  is  approximately  constant  at  a  level  of  0.12  minutes  per 
iteration  fom  sample  size  20  up  to  20.  Then  it  jumps  to  a  level  of  approximately 
0.17  min  per  iteration  where  it  again  remains  approximately  constant. 

In  the  test  example  the  number  of  preparatory  subproblems  to  compute  the 
importance  distribution  is  29.  Figure  2  shows  how  the  algorithm  parallelizes  to 
indicate  the  efficiency  of  the  parallel  algorithm.  The  figure  shows  schematically 


23 


busy  and  idle  times  for  different  processors  in  case  of  sample  size  63  during  the 
first  two  iterations.  Note  the  two  phases  of  solving  subproblems  the  preparatory 
phase  and  the  sample  phase.  While  in  the  preparatory  phase  only  29  subproblems 
have  to  be  solved  compared  with  having  to  solve  63  subproblems  in  the  sample 
phase.  Each  optimization  is  started  using  the  basis  of  the  optimal  solution  of  the 
problem  previously  solved  on  the  same  processor.  At  the  beginning,  all  problems 
are  started  from  scratch  as  no  basis  is  available.  In  the  first  iteration  processors  1 
to  29  start  from  scratch  in  the  preparatory  phase  but  use  the  optimal  bases  from 
the  preparatory  subproblems  in  the  sample  phase.  Processors  30  to  63  do  not  solve 
subproblems  in  the  preparatory  phase,  thus  the  sample  sub-problems  assigned  to 
these  processors  are  started  from  scratch. 

Solving  a  subproblem  from  scratch  takes  considerably  more  time  than  solving 
it  with  a  good  starting  basis  (warm  start).  The  master  processor  starts  operation 
when  all  necessary  subproblems  are  solved  completely,  both  in  the  preparatory 
phase  and  the  sampling  phase.  The  computing  time  in  each  phase  is  determined 
by  the  maximum  duration  spent  for  solving  a  subproblem.  In  the  first  iteration 
processors  30  to  63  are  idle  during  the  preparation  phase  and  solve  subproblems 
from  scratch  in  the  sample  phase;  the  maximum  time  spent  in  the  sample  phase  by 
these  processors  is  much  larger  than  the  maximum  time  spent  by  processors  1  to 
29.  The  duration  of  the  sample  phase  in  the  first  iteration  is  thcrforc  much  larger 
in  the  case  of  sample  sizes  larger  thnn  ?9.  the  number  of  Dreparatory  subproblems. 
The  jump  in  the  computing  time  at  sample  size  30  is  due  to  this  effect  . 

Besides  the  impact  of  the  starting  basis  in  ti  e  first,  iteration,  there  is  also  an 
impact  in  all  other  iterations.  A  basis  of  the  optimal  solution  of  a  subproblem  of  the 
current  iteration  is  expected  to  be  a  better  starting  basis  than  a  basis  of  the  optimal 
solution  of  a  subproblem  of  the  previous  iteration.  Note  that  the  effect  only  occurs 
if  nprpp  <  p  —  1  and  n  >  nprep.  We  overcome  this  effect  by  supplying  a  proper 
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basis  to  subprocessors  30  to  63.  In  general  one  could  copy  the  optimal  basis  of  the 
subproblem  which  has  finished  first  in  the  preparatory  phase  to  processors  30  to  63 
to  warm  start  all  subs  in  the  sample  phase.  As  idle  processsors  axe  not  used  for 
any  other  tasks  and  cannot  be  used  in  a  timesharing  mode  by  other  users  it  is  more 
efficient  (as  no  communication  is  necessary)  to  assign  a  preparatory  subproblem 
(e.g.  subproblem  1)  also  to  processors  30  to  63  and  solve  it  to  have  the  optimum 
starting  basis  ready  for  the  sampling  phase.  Table  2  also  shows  the  results  for 
warm  starting  all  subs.  The  computing  time  remains  approximately  constant  over 
the  whole  range  of  sample  sizes.  The  results  show  that  the  effect  is  completely 
compensated.  When  using  the  warm  start  feature  no  time  differences  resulting 
from  nprep  <  n  can  be  observed.  Thus  the  model  for  determining  the  parallel  time 
tp  is  valid  for  all  numbers  of  preparatory  problems  nprep. 

The  analysis  so  far  has  concerned  previous  implementations  where  the  assign¬ 
ment  of  subproblems  to  subprocessors  was  hardwired.  In  our  current  implemen¬ 
tation  sub  problems  are  sent  to  the  next  idle  node.  This  implementation  allows 
for  any  size  of  subproblems  nprep  and  n  per  iteration  and  divides  up  the  number 
of  subproblems  efficiently  to  the  number  of  processors  available.  If  necessary  the 
warm  start  procedure  is  used.  In  the  following  we  are  interested  in  the  efficiency  of 
the  method  both  with  respect  to  the  sample  size  and  with  respect  to  the  number 
of  processors. 

For  determining  the  efficiency  we  use  the  formulae  developed  in  the  previous 
section  6.  Varying  the  sample  size  over  a  sufficiently  large  range,  we  estimate  the 
parameters  for  determining  the  computing  time.  Table  3  gives  the  results  for  sample 
sizes  from  100  to  600  using  64  processors  representing  the  parallel  computation  time 
versus  the  sample  size  for  both  the  actual  time  measurements  and  the  estimates  from 
the  formulae.  One  can  see  that  the  algebraic  formulae  give  an  excellent  estimate 
of  the  actual  parallel  computing  time.  We  estimate  the  parameters  t\{A  +  t°suB 
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to  be  0.0962  and  tsuB  to  be  0.0149.  Using  these  parameters  we  compute  the 
corresponding  serial  time  ta,  the  speedup  S  and  the  efficiency  E,  which  are  also 
reported  in  Table  3.  While  the  efficiency  is  low  for  small  sample  sizes  it  rapidly 
improves  with  increasing  sample  size.  In  the  case  of  sample  size  600,  we  obtain  a 
speedup  of  about  37.5  which  means  using  64  processors  we  reduce  the  computation 
time  by  a  factor  of  37.5.  The  total  parallel  time  is  17.3  minutes  while  in  a  serial 
implementation  the  time  to  solve  the  problem  would  be  652  minutes.  Figure  3 
shows  the  dependency  of  the  efficiency  upon  the  sample  size  when  64  processors  are 
used. 

Using  estimates  based  on  the  formulae  for  the  parallel  time,  we  compute  the 
efficiency  as  a  function  of  the  number  of  parallel  processors  used.  Figure  4  gives  a 
graphical  representation.  For  small  numbers  of  processors  the  effect  of  only  p  —  1 
processors  operating  in  parallel  when  using  p  processors  dominates  the  result.  For 
example  when  using  2  processors  we  switch  between  the  master  processor  and  only 
one  sub  processor.  There  is  no  parallel  overlapping  in  the  computation.  In  this  case 
we  perform  a  serial  computation  distributed  to  2  processors.  The  efficiency  hence 
is  50%.  The  efficiency  increases  until  the  above  mentioned  effect  is  not  dominating 
anymore.  E.g.  for  sample  size  600  and  using  12  processors,  the  efficiency  is  about 
82%.  The  efficiency  decreases  with  increasing  numbers  of  processors  beyond  12. 
Using  64  processors,  we  obtain  an  efficiency  of  58.54%  when  sample  size  is  600. 

Corresponding  to  the  runs  documented  in  Table  3,  Table  4  reports  on  the  op¬ 
timum  objective  function  value  and  the  95%  confidence  interval.  The  lower  bound 
distributions  have  less  variance  than  the  upper  bound  distributions,  hence  the  con¬ 
fidence  interval  is  asymmetric.  Using  a  sample  size  of  100  (out  of  about  1  million 
universe  scenarios)  we  obtain  an  optimal  solution  of  188348.7  with  a  95%  confidence 
intervall  of  0.08%  on  the  lower  side  and  0.018%  on  the  upper  side.  Even  with  only 
small  sample  sizes  we  obtain  amazingly  accurate  results.  The  parallel  time  to  run 
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the  problem  was  8.3  minutes  on  the  Hypercube  multicomputer. 

The  optimal  objective  function  value  remains  stable  when  increasing  the  sample 
size.  That  again  shows  that  we  obtained  good  estimates.  The  confidence  interval 
decreases  with  increasing  sample  size  and  the  rate  of  n-0  5  is  verified  by  the  com¬ 
putational  results. 

Using  a  sample  size  of  600  we  obtain  an  optimal  objective  function  value  of 
188351.8  with  a  95%  confidence  interval  of  0.04%  on  the  left  side  and  of  0.06% 
on  the  right  side.  Thus  the  optimal  solution  lies  with  95%  confidence  within 
188276.7  >  z*  >  188473.0.  All  solutions  reported  in  Table  4  fall  within  this  range. 
The  computation  time  on  the  Hypercube  multicomputer  was  17.3  minutes.  It  is 
interesting  to  note  that  during  the  process  of  solving  the  problem  about  43400  sub¬ 
problems  of  the  size  of  289  rows  and  302  columns  each  and  69  master  problems  were 
solved. 
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Table  1:  Communication  time 


nodes 

reserved 

1 

n 

iter 

CPU 

(min) 

sending 

tables 

CPU 

(min) 

sending 

indices 

diff 

obj 

32 

32 

H 

66 

7.916 

7.898 

188482 

32 

32 

is 

64 

11.236 

11.173 

188271 

64 

51 

50 

63 

11.118 

I 

188378 

64 

51 

60 

70 

12.167 

12.035 

188549 

Table  2:  Warm  start  all  subs 


. 

with  no 
warm  start 

with  a 

warm  start 

nodes 

reserved 

H 

n 

iter 

CPU 

(min) 

time/it 

CPU 

(min) 

time/it 

obj 

32 

32 

20 

66 

7.898 

R> 

7.898 

0.120 

188382 

32 

32 

24 

63 

7.502 

mm 

7.502 

0119 

188025 

32 

32 

26 

61 

7.866 

SUB 

7.866 

0.129 

188236 

32 

32 

27 

56 

6.219 

0.111 

6.319 

0.111 

188232 

32 

32 

28 

52 

6.434 

0.124 

6.434 

0.124 

188195 

32 

32 

29 

60 

7.303 

0.122 

7.303 

0.122 

188492 

32 

32 

30 

64 

11.173 

0.175 

7.770 

0.121 

188271 

32 

32 

31 

60 

10.767 

0.179 

7.331 

0.122 

1S8301 

64 

33 

32 

64 

12.409 

0.194 

N/A 

N/A 

188347 

64 

36 

35 

59 

10.334 

0.175 

N/A 

N/A 

188295 

64 

41 

40 

63 

10.898 

0.173 

7.516 

0.119 

188261 

64 

51 

50 

63 

11.034 

0.175 

7.528 

0.119 

188378 

64 

61 

60 

70 

12.035 

0.172 

8.374 

0.120 

188549 

64 

64 

63 

75 

12.645 

0.169 

8.821 

0.118 

188492 

Table  3  :  Speedup  and  Efficiency 


n 

iter 

b 

t. 

S 

E 

actual 

est.  by 
formula 

speedup 

efficiency 

H 

63 

0.132 

mw 

2.024 

14.99 

23.456 

72 

0.159 

■ 

3.519 

22.13 

34.674 

na 

76 

0.182 

i 

5.014 

27.55 

42.973 

400 

84 

0.213 

■ 

6.508 

31.59 

49.360 

500 

69 

0.229 

i 

8.003 

34.80 

54.428 

600 

69 

0.250 

0.253 

9.497 

37.54 

58.547 

Table  4:  Optimal  Solution 


95%  confidence  interval 

n 

iter 

obj 

lower 

upper 

total 

% 

lower 

+upper 

of  obj 

(min) 

H 

63 

188348.7 

344.4 

497.4 

ng 

8.3 

19 

72 

188390.9 

144.8 

161.8 

306.6 

m 

11.4 

76 

188344.9 

180.2 

280.7 

0.15 

13.8 

400 

84 

188328.4 

79.9 

153.7 

233.5 

0.12 

17.9 

500 

69 

188304.0 

78.0 

131.1 

209.0 

0.11 

15.8 

600 

69 

188351.8 

75.1 

121.2 

196.3 

0.10 

17.3 

29 


Figure  1.  Hypercubes  of  dimension  n  <  4 


Figure  2.  Efficiency  of  the  parallel  implementation 


Figure  3.  Efficiency  versus  sample  size 
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